Abstract

Salary data for Iowa State University (ISU) employees is publicly available through the state government. CyChecks, an R package, was developed with the intent of examining and interpreting the university’s salary dataset. This package will allow users to compare pay between genders, positions, and departments, while displaying the results in a dynamic interface (shiny app). While this data is publicly available, it was also understood that salaries are sensitive, personal information and therefore an individual’s name that could easily be associated with their salary was anonymized. We hope Cychecks will provide useful fodder for discussing pay patterns within a publicly funded university.

Introduction

The goal of CyChecks is to allow users easy access to the publicly available data concerning Iowa State University (the Cyclones) employee salaries. #### Some of the services this access could offer include:

  1. Aiding job seekers in negotiating starting salaries.
  2. Shedding light on possible pay inequities with respect to gender.
  3. Identifying the highest paid positions at the university.

Literature Review

Pay patterns and/or pay disparities have been an on-going topic of discussion and focus for research since [insert date]. The [insert law] aimed to dismantle these disparities but unfortunately the female gender is still under xx.

Data Description

The state of Iowa offers a large amount of public data at this site. You can access the data by signing up for an API token here. Iowa state employee salaries are available at this site. Using an API token, CyChecks provides a function to easily get data for any given year. The data from this site does not include the employee’s home department. Unfortunately, after much exploring, the data linking names to departments is not easily accessible. Iowa State Univerisity’s Human Resources Department kindly provided a list of employees with their home departments and associated colleges valid as of January 1st 2019. Since acquiring this information is not reproducible to the average user, included is a full dataset (with names anonymized) of all salary info (from years 2007 to 2018) cross-referenced by department in the CyChecks package. It was discovered that employees that had left before the year 2019 will not listed under their respective department in this dataset. This isn’t ideal, but a reality of trying to display this sort of data.

The following displays parts of the dataset more explicitly.

Below you will see the structure of the data, which displays information about the size of the dataset (number of observations), the number of variables (categories), and the type of data that variable provides.

library(CyChecks)
data("sals_dept_profs")
#load(file = "./data/sals_dept_profs.rda")
#view(sals_dept_profs)
cychecks_dept_data <- sals_dept_profs
str(cychecks_dept_data)
#> 'data.frame':    16633 obs. of  12 variables:
#>  $ base_salary_date   : POSIXct, format: "2007-07-01" "2007-07-01" ...
#>  $ fiscal_year        : Factor w/ 12 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ gender             : Factor w/ 3 levels "M","F","*": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ place_of_residence : Factor w/ 963 levels "STORY","COOK",..: 1 1 1 1 7 1 1 1 1 1 ...
#>  $ position           : chr  "professor" "professor" "professor" "professor" ...
#>  $ total_salary_paid  : num  82467 81893 102340 74169 120006 ...
#>  $ travel_subsistence : num  4429 700 NA NA 6315 ...
#>  $ base_salary        : chr  "84597" "71281" "94971" "74911" ...
#>  $ organization       : Factor w/ 18 levels "AMES LABORATORY OF US DOE",..: NA NA NA NA NA 7 7 7 3 2 ...
#>  $ department         : Factor w/ 233 levels "AMES LABORATORY",..: NA NA NA NA NA 86 97 99 36 2 ...
#>  $ id                 : chr  "94e35beb" "aa677a76" "ff1d3cf6" "b2540156" ...
#>  $ position_simplified: chr  "professor" "professor" "associate" "professor" ...

listed below are the first 6 rows of the dataset just to give you an idea of how it looks

head(cychecks_dept_data)
#>   base_salary_date fiscal_year gender place_of_residence  position
#> 1       2007-07-01        2007      M              STORY professor
#> 2       2007-07-01        2007      M              STORY professor
#> 3       2007-07-01        2007      M              STORY professor
#> 4       2007-07-01        2007      M              STORY professor
#> 5       2007-07-01        2007      M               POLK professor
#> 6       2007-07-01        2007      M              STORY professor
#>   total_salary_paid travel_subsistence base_salary
#> 1           82467.0            4429.17       84597
#> 2           81893.4             700.00       71281
#> 3          102340.3                 NA       94971
#> 4           74169.0                 NA       74911
#> 5          120006.0            6315.17      127306
#> 6           80801.6            5791.86       72465
#>                         organization department       id
#> 1                               <NA>       <NA> 94e35beb
#> 2                               <NA>       <NA> aa677a76
#> 3                               <NA>       <NA> ff1d3cf6
#> 4                               <NA>       <NA> b2540156
#> 5                               <NA>       <NA> 74eb0279
#> 6 COLLEGE OF LIBERAL ARTS & SCIENCES   EEOB-LAS 9d2c3ad7
#>   position_simplified
#> 1           professor
#> 2           professor
#> 3           associate
#> 4           professor
#> 5           professor
#> 6           associate

listed below are all the categorical variables a user can look into

names(cychecks_dept_data)
#>  [1] "base_salary_date"    "fiscal_year"         "gender"             
#>  [4] "place_of_residence"  "position"            "total_salary_paid"  
#>  [7] "travel_subsistence"  "base_salary"         "organization"       
#> [10] "department"          "id"                  "position_simplified"

Package Description & Package Functions

This package (CyChecks) is an R package that helps a user conveniently “check” various information concerning salary information for Iowa State University employees and faculty. CyChecks

CyChecks currently includes several functions that help the user navigate and explore this data more easliy, while also 1. A function to: 1. download data directly from the iowa.gov website 2. anonymize names of downloaded data 3. simplify professor and post-doc position titles 4. quickly identify departments with possible gender pay inequities 1. Data from: i. Employee/salary/department dataset for 2007-2018 ii. Employee/salary/department/college dataset assumed to be valid for 2018

Function 1: Web Scrapping (aka sal_df)

-Get data from the web

#insert example of what function executes

Function 2: Anonymize names (aka anonymize)

-convert each individual’s personal name to an alphanumeric id to mask the individual’s real identity

#insert example of what function executes

Function 3:Get professor info (aka get_profs)

-find the pay gap between different positions

#insert example of what function executes

Function 4: Running statistics (aka stats_mf)

Run stats to find significance between male versus female with respect to total salary

#insert example of what function executes

Analysis and Results (Graphs and Tables)

The following graphs display some examples of the pay pattern across departments and over the years for different positons at Iowa State University.

#used packages
library(CyChecks)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
#> -- Attaching packages ---------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.1.0     v readr   1.3.1
#> v tibble  2.1.1     v purrr   0.3.2
#> v tidyr   0.8.3     v stringr 1.4.0
#> v ggplot2 3.1.0     v forcats 0.4.0
#> -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(ggplot2)

###Load data
data("sals_dept.rda")
#> Warning in data("sals_dept.rda"): data set 'sals_dept.rda' not found
#load(file = "C:/Users/lawto/Desktop/Spring2019/STAT585/finalproj/CyChecks/CyChecks/data/sals_dept.rda")
#view(sals_dept)
cychecks_data <- sals_dept
#view(cychecks_data)
#load(file = "C:/Users/lawto/Desktop/Spring2019/STAT585/finalproj/CyChecks/CyChecks/data/sals_dept_profs.rda")
#view(sals_dept_profs)
cychecks_dept_data <- sals_dept_profs
#view(cychecks_dept_data)

#clean data
#sals_dept <- sals_dept %>% filter(!is.na(gender), gender != "*")
# sals_dept_profs <- sals_dept_profs %>% filter(!is.na(gender), gender != "*")
# department <- c("All departments", sort(unique(as.character(sals_dept$department))))
# fiscal_year <- c("All years", sort(unique(as.character(sals_dept$fiscal_year))))

#plot data to analyze
gender_pay <- cychecks_data %>% 
  ggplot(aes(x = fiscal_year, y = total_salary_paid)) +
  geom_point(aes(col=gender)) +
  labs(subtitle = "Gender Vs Total Salary Paid Each Year",
       y="Total Salary Paid",
       x="Year")
plotly::ggplotly(gender_pay) #interactive map


gender_pay2 <- cychecks_data %>%
  ggplot(aes(x = total_salary_paid)) +
  geom_bar(aes(col=gender)) +
   xlim(c(0, 2020)) + 
  ylim(c(0, 100)) + 
  labs(subtitle = "Gender Vs Total Salary Paid Each Year",
       y="Total Salary Paid",
       x="Year")
plotly::ggplotly(gender_pay2)
#> Warning: Removed 103737 rows containing non-finite values (stat_count).
#> Warning: position_stack requires non-overlapping x intervals
#plot(gender_pay2)

Conclusion

Limitations & Future Work

References/Bibliography

Packaage Website/Vignette

Follow the link